Web-crawling reliability
نویسنده
چکیده
In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling. It is shown that Web crawling by search engines is intentionally biased and selective. I also report the results of a large-scale experimental simulation of Web crawling that illustrates the effects of different crawling policies on data collection. It is concluded that the reliability of Web crawling as a data collection technique is improved by fuller reporting of relevant crawling policies. Introduction The Web graph model of the World Wide Web (or just the Web) is now firmly established (Broder et al., 2000). The model provides both an analytic framework for studying the Web and a mental model for discussing the Web 4) describe crawling as " the process of identifying and fetching Web pages, usually for indexing, by traversing the Web link graph from a set of starting points. " That is, Web crawlers move from node (or document) to node by means of the hyperlinks that each node contains and that define the edges of the Web graph. This Web-crawling technique, coupled with document indexing and retrieval procedures, has proved particularly successful at supporting Web search engine systems. These make relevant Web document content accessible to users. Web crawling is also the data collection technique that supports structural and informetric analyses of the Web graph (e.g., Meghabghab, 2001). In particular, one form of analysis of the Web graph considers the inlinks and outlinks of nodes, which represent document references and citations, These give rise to constraints and compromises in operation that affect data collection; the set of operational characteristics of a Web crawler is described here as the Web-crawling policy. This paper investigates the practice of Web crawling and contrasts content crawling with link crawling. Content crawling is used here to refer to the objective or goal of the Web-crawling procedure (typically undertaken to support Web search engines) used in conjunction with discovering and indexing the content of documents that compose the Web. Content crawling, for example, ignores duplicate documents and thus differs from link crawling, which does not. In addition, I present the findings of an experiment to discover the effects of different Web-crawling policies to investigate the …
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملIntelligent Web Crawling
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of m...
متن کاملFocused Crawling Techniques
The need for more and more specific reply to a web search query has prompted researchers to work on focused web crawling techniques for web spiders. Variety of lexical and link based approaches of focused web crawling are introduced in the paper highlighting important aspects of each. General Terms Focused Web Crawling, Algorithms, Crawling Techniques.
متن کاملAn extended model for effective migrating parallel web crawling with domain specific crawling
The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parall...
متن کاملAn Extended Model for Effective Migrating Parallel Web Crawling with Domain Specific and Incremental Crawling
The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parall...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JASIST
دوره 55 شماره
صفحات -
تاریخ انتشار 2004